Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Data

Loading data

TrainSet <- read.csv("pml-training.csv", stringsAsFactors = F,na.strings = c("","NA","#DIV/0!"))
TestSet <- read.csv("pml-testing.csv", stringsAsFactors = F,na.strings = c("","NA","#DIV/0!"))
dim(TrainSet); dim(TestSet)
## [1] 19622   160
## [1]  20 160

Validation Model:

set.seed(101)
inTrain <- createDataPartition(TrainSet$classe, p = 0.8, list = F)
DatValue <- TrainSet[-inTrain,]
TrainSet <- TrainSet[inTrain,]
dim(TrainSet); dim(DatValue)
## [1] 15699   160
## [1] 3923  160
table(TrainSet$classe)/nrow(TrainSet)
## 
##         A         B         C         D         E 
## 0.2843493 0.1935155 0.1744060 0.1638958 0.1838334

The data is not much baised with respect to different classes

Missing Data

Belt

For Belt sensor:

belt_miss <- sapply(select(TrainSet,names(TrainSet)[grepl("_belt",names(TrainSet))]),
                    function(x) sum(is.na(x)))
belt_miss
##            roll_belt           pitch_belt             yaw_belt 
##                    0                    0                    0 
##     total_accel_belt   kurtosis_roll_belt  kurtosis_picth_belt 
##                    0                15396                15413 
##    kurtosis_yaw_belt   skewness_roll_belt skewness_roll_belt.1 
##                15699                15395                15413 
##    skewness_yaw_belt        max_roll_belt       max_picth_belt 
##                15699                15388                15388 
##         max_yaw_belt        min_roll_belt       min_pitch_belt 
##                15396                15388                15388 
##         min_yaw_belt  amplitude_roll_belt amplitude_pitch_belt 
##                15396                15388                15388 
##   amplitude_yaw_belt var_total_accel_belt        avg_roll_belt 
##                15396                15388                15388 
##     stddev_roll_belt        var_roll_belt       avg_pitch_belt 
##                15388                15388                15388 
##    stddev_pitch_belt       var_pitch_belt         avg_yaw_belt 
##                15388                15388                15388 
##      stddev_yaw_belt         var_yaw_belt         gyros_belt_x 
##                15388                15388                    0 
##         gyros_belt_y         gyros_belt_z         accel_belt_x 
##                    0                    0                    0 
##         accel_belt_y         accel_belt_z        magnet_belt_x 
##                    0                    0                    0 
##        magnet_belt_y        magnet_belt_z 
##                    0                    0

Arm

For Arm sensor:

arm_miss <- sapply(select(TrainSet,names(TrainSet)[grepl("_arm",names(TrainSet))]),
                   function(x) sum(is.na(x)))
arm_miss
##            roll_arm           pitch_arm             yaw_arm     total_accel_arm 
##                   0                   0                   0                   0 
##       var_accel_arm        avg_roll_arm     stddev_roll_arm        var_roll_arm 
##               15388               15388               15388               15388 
##       avg_pitch_arm    stddev_pitch_arm       var_pitch_arm         avg_yaw_arm 
##               15388               15388               15388               15388 
##      stddev_yaw_arm         var_yaw_arm         gyros_arm_x         gyros_arm_y 
##               15388               15388                   0                   0 
##         gyros_arm_z         accel_arm_x         accel_arm_y         accel_arm_z 
##                   0                   0                   0                   0 
##        magnet_arm_x        magnet_arm_y        magnet_arm_z   kurtosis_roll_arm 
##                   0                   0                   0               15446 
##  kurtosis_picth_arm    kurtosis_yaw_arm   skewness_roll_arm  skewness_pitch_arm 
##               15448               15398               15445               15448 
##    skewness_yaw_arm        max_roll_arm       max_picth_arm         max_yaw_arm 
##               15398               15388               15388               15388 
##        min_roll_arm       min_pitch_arm         min_yaw_arm  amplitude_roll_arm 
##               15388               15388               15388               15388 
## amplitude_pitch_arm   amplitude_yaw_arm 
##               15388               15388

Forearm

For Forearm sensor:

forearm_miss <- sapply(select(TrainSet,
                              names(TrainSet)[grepl("_forearm",names(TrainSet))]),
                       function(x) sum(is.na(x)))
forearm_miss
##            roll_forearm           pitch_forearm             yaw_forearm 
##                       0                       0                       0 
##   kurtosis_roll_forearm  kurtosis_picth_forearm    kurtosis_yaw_forearm 
##                   15448                   15449                   15699 
##   skewness_roll_forearm  skewness_pitch_forearm    skewness_yaw_forearm 
##                   15447                   15449                   15699 
##        max_roll_forearm       max_picth_forearm         max_yaw_forearm 
##                   15388                   15388                   15448 
##        min_roll_forearm       min_pitch_forearm         min_yaw_forearm 
##                   15388                   15388                   15448 
##  amplitude_roll_forearm amplitude_pitch_forearm   amplitude_yaw_forearm 
##                   15388                   15388                   15448 
##     total_accel_forearm       var_accel_forearm        avg_roll_forearm 
##                       0                   15388                   15388 
##     stddev_roll_forearm        var_roll_forearm       avg_pitch_forearm 
##                   15388                   15388                   15388 
##    stddev_pitch_forearm       var_pitch_forearm         avg_yaw_forearm 
##                   15388                   15388                   15388 
##      stddev_yaw_forearm         var_yaw_forearm         gyros_forearm_x 
##                   15388                   15388                       0 
##         gyros_forearm_y         gyros_forearm_z         accel_forearm_x 
##                       0                       0                       0 
##         accel_forearm_y         accel_forearm_z        magnet_forearm_x 
##                       0                       0                       0 
##        magnet_forearm_y        magnet_forearm_z 
##                       0                       0

Dumbbell

For Dumbbell sensor:

dumbbell_miss <- sapply(select(TrainSet,
                               names(TrainSet)[grepl("_dumbbell",names(TrainSet))]),
                        function(x) sum(is.na(x)))
dumbbell_miss
##            roll_dumbbell           pitch_dumbbell             yaw_dumbbell 
##                        0                        0                        0 
##   kurtosis_roll_dumbbell  kurtosis_picth_dumbbell    kurtosis_yaw_dumbbell 
##                    15392                    15390                    15699 
##   skewness_roll_dumbbell  skewness_pitch_dumbbell    skewness_yaw_dumbbell 
##                    15391                    15389                    15699 
##        max_roll_dumbbell       max_picth_dumbbell         max_yaw_dumbbell 
##                    15388                    15388                    15392 
##        min_roll_dumbbell       min_pitch_dumbbell         min_yaw_dumbbell 
##                    15388                    15388                    15392 
##  amplitude_roll_dumbbell amplitude_pitch_dumbbell   amplitude_yaw_dumbbell 
##                    15388                    15388                    15392 
##     total_accel_dumbbell       var_accel_dumbbell        avg_roll_dumbbell 
##                        0                    15388                    15388 
##     stddev_roll_dumbbell        var_roll_dumbbell       avg_pitch_dumbbell 
##                    15388                    15388                    15388 
##    stddev_pitch_dumbbell       var_pitch_dumbbell         avg_yaw_dumbbell 
##                    15388                    15388                    15388 
##      stddev_yaw_dumbbell         var_yaw_dumbbell         gyros_dumbbell_x 
##                    15388                    15388                        0 
##         gyros_dumbbell_y         gyros_dumbbell_z         accel_dumbbell_x 
##                        0                        0                        0 
##         accel_dumbbell_y         accel_dumbbell_z        magnet_dumbbell_x 
##                        0                        0                        0 
##        magnet_dumbbell_y        magnet_dumbbell_z 
##                        0                        0

Few of the features are over 90% missing therefore it is better to drop those, after the drop there are 52 predictors left.

column_2drop <- c(names(belt_miss[belt_miss != 0]), 
                  names(arm_miss[arm_miss != 0]),
                  names(forearm_miss[forearm_miss != 0]),
                  names(dumbbell_miss[dumbbell_miss != 0]))
length(column_2drop)
## [1] 100

Analysis

#dropping the cols
Anaylsis <- tbl_df(TrainSet %>%  select(-column_2drop,
                             -c(X,user_name, raw_timestamp_part_1, 
                                raw_timestamp_part_2, cvtd_timestamp, 
                                new_window,num_window)))
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(column_2drop)` instead of `column_2drop` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Anaylsis$classe <- as.factor(Anaylsis$classe)
Anaylsis[,1:52] <- lapply(Anaylsis[,1:52],as.numeric)
dim(Anaylsis)
## [1] 15699    53

Correlation

corr_col <- cor(select(Anaylsis, -classe))
diag(corr_col) <- 0
corr_col <- which(abs(corr_col)>0.8,arr.ind = T)
corr_col <- unique(row.names(corr_col))
corrplot(cor(select(Anaylsis,corr_col)),
         type="upper", order="hclust",method = "number")
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(corr_col)` instead of `corr_col` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

Correlation plot makes it clear that there are a lot of columns that are highly correlated and high correlation is only seen between the same sensor i.e. “belt”,“arm”,“forearm” and “dumbbell”.

Targeted Correlation

We use correlationfunnel::correlate to see the correlation with each level of“classe” and other features

correlation_funl_df <- Anaylsis %>% binarize(n_bins = 4, thresh_infreq = 0.01)

classe__A

correlation_a <- correlation_funl_df %>% correlate(target = classe__A) 
correlation_a %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))

For *classe__A* the “Arm and Forearm” sensors are more important.

  • “accel_arm_x” is correlated with “magnet_arm_x” and “gyros_arm_y” is correlated with “gyros_arm_x”, so they wont be considered.
  • Top 5 significant features for "classe__A" are - (magnet_arm_x, pitch_forearm , magnet_dumbbell_y, roll_forearm, gyros_dumbbell_y)

classe__B

correlation_b <- correlation_funl_df %>% correlate(target = classe__B)
correlation_b %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))

For *classe__B* the “Dumbbell and Belt” sensors are more important.

  • Top 5 significant features for "classe__A" are - (magnet_dumbbell_y, magnet_dumbbell_x , roll_dumbbell , magnet_belt_y , accel_dumbbell_x )

classe__C

correlation_c <- correlation_funl_df %>% correlate(target = classe__C)
correlation_c %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))

For *classe__C* the “Dumbbell” sensors are more important.

  • Top 5 significant features for "classe__A" are - (magnet_dumbbell_y, roll_dumbbell , accel_dumbbell_y , magnet_dumbbell_x, magnet_dumbbell_z)

classe__D

correlation_d <- correlation_funl_df %>% correlate(target = classe__D)
correlation_d %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))

For *classe__D* the “Forearm, Arm and Dumbbell” sensors are more important.

  • Top 5 significant features for "classe__A" are - (pitch_forearm , magnet_arm_y , magnet_forearm_x, accel_dumbbell_y, accel_forearm_x)

classe__E

correlation_e <- correlation_funl_df %>% correlate(target = classe__E)
correlation_e %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))

For *classe__E* the “Belt” sensors are more important.

  • “total_accel_belt” is correlated with “roll_belt” ,“yaw_belt” is correlated with “roll_belt” ,“accel_belt_z” is correlated with “roll_belt”, so we wont consider them.
  • Top 5 significant features for "classe__A" are - (magnet_belt_y , magnet_belt_z , roll_belt, gyros_belt_z , magnet_dumbbell_y)

Plots

Top 5 features are presented in these columns

col_a <- c("magnet_arm_x", "pitch_forearm" , "magnet_dumbbell_y", 
           "roll_forearm", "gyros_dumbbell_y") 
col_b <- c("magnet_dumbbell_y", "magnet_dumbbell_x" , "roll_dumbbell" , 
           "magnet_belt_y" , "accel_dumbbell_x" )
col_c <- c("magnet_dumbbell_y", "roll_dumbbell" , "accel_dumbbell_y" , 
           "magnet_dumbbell_x", "magnet_dumbbell_z")
col_d <- c("pitch_forearm" , "magnet_arm_y" , "magnet_forearm_x",
           "accel_dumbbell_y", "accel_forearm_x")
col_e <- c("magnet_belt_y" , "magnet_belt_z" , "roll_belt", 
           "gyros_belt_z" , "magnet_dumbbell_y")
final_cols <- character()
for(c in c(col_a,col_b,col_c,col_d,col_e)){
  final_cols <- union(final_cols, c)
}
Anaylsis2 <- Anaylsis %>% select(final_cols, classe)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(final_cols)` instead of `final_cols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
data.frame("arm" = sum(grepl("_arm",final_cols)), 
           "forearm" = sum(grepl("_forearm",final_cols)),
           "belt" = sum(grepl("_belt",final_cols)),
           "dumbbell" = sum(grepl("_dumbbell",final_cols)))
##   arm forearm belt dumbbell
## 1   2       4    4        7
TrainSet <- TrainSet %>%  select(final_cols,classe)
DatValue <- DatValue %>%  select(final_cols,classe)
TrainSet[,1:17] <- sapply(TrainSet[,1:17],as.numeric)
DatValue[,1:17] <- sapply(DatValue[,1:17],as.numeric)
levels <- c("A", "B", "C", "D", "E")
preprop_obj <- preProcess(TrainSet[,-18],method = c("center","scale","BoxCox"))
xTrain <- predict(preprop_obj,select(TrainSet,-classe))
yTrain <- factor(TrainSet$classe,levels=levels)
xVal <- predict(preprop_obj,select(DatValue,-classe))
yVal <- factor(DatValue$classe,levels=levels)
trControl <- trainControl(method="cv", number=5)

Dumbbell sensor turned out to be the most important sensor among the 4

Pairs plot

my_dens <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_density(..., alpha = 0.3)+scale_fill_brewer(palette="Set2") 
}
my_point <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_point(..., alpha = 0.1)+ scale_fill_brewer(palette="Set2") 
}
ggpairs(Anaylsis2, columns = 1:5,aes(color = classe),
        lower = list(continuous = my_point),diag = list(continuous = my_dens))

ggpairs(Anaylsis2, columns = 6:10,aes(color = classe),
        lower = list(continuous = my_point),diag = list(continuous = my_dens))

ggpairs(Anaylsis2, columns = 11:17,aes(color = classe),
        lower = list(continuous = my_point),diag = list(continuous = my_dens))

Most of the features are very skewed, so as a preprocessing step we have to “center”, “rescale” and use “BoxCox” the features.

Modelling

Random Forest

modelRF <- train(x = xTrain,y = yTrain, 
                 method = "rf", trControl = trControl,verbose=FALSE, metric = "Accuracy")
confusionMatrix(predict(modelRF,xVal),yVal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    8    0    0    0
##          B    3  741    6    0    1
##          C    1    7  675   15    4
##          D    0    3    3  627    1
##          E    0    0    0    1  715
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9865          
##                  95% CI : (0.9824, 0.9899)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9829          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9763   0.9868   0.9751   0.9917
## Specificity            0.9971   0.9968   0.9917   0.9979   0.9997
## Pos Pred Value         0.9929   0.9867   0.9615   0.9890   0.9986
## Neg Pred Value         0.9986   0.9943   0.9972   0.9951   0.9981
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1889   0.1721   0.1598   0.1823
## Detection Prevalence   0.2855   0.1914   0.1789   0.1616   0.1825
## Balanced Accuracy      0.9968   0.9866   0.9893   0.9865   0.9957
plot(modelRF$finalModel,main="Error VS no of tree")

Random Forest 98%+ accuracy was the highest.

Classification Tree

modelCT <- train(x = xTrain,y = yTrain, 
                 method = "rpart", trControl = trControl)
confusionMatrix(predict(modelCT,xVal),yVal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1003  330  319  294  106
##          B   19  256   20  109  103
##          C   93  173  345  240  212
##          D    0    0    0    0    0
##          E    1    0    0    0  300
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4853          
##                  95% CI : (0.4696, 0.5011)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3271          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8987  0.33729  0.50439   0.0000  0.41609
## Specificity            0.6263  0.92067  0.77833   1.0000  0.99969
## Pos Pred Value         0.4888  0.50493  0.32455      NaN  0.99668
## Neg Pred Value         0.9396  0.85275  0.88147   0.8361  0.88377
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2557  0.06526  0.08794   0.0000  0.07647
## Detection Prevalence   0.5231  0.12924  0.27097   0.0000  0.07673
## Balanced Accuracy      0.7625  0.62898  0.64136   0.5000  0.70789

Classification tree’s accuracy is very low.

SVM

modelSVM <- svm(x = xTrain,y = yTrain,
                kernel = "polynomial", cost = 10)
confusionMatrix(predict(modelSVM,xVal),yVal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1096   40   18   17    2
##          B    1  676   15    5    6
##          C    9   40  640   45    3
##          D   10    3    9  575    9
##          E    0    0    2    1  701
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9401          
##                  95% CI : (0.9322, 0.9473)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9241          
##                                           
##  Mcnemar's Test P-Value : 1.808e-15       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.8906   0.9357   0.8942   0.9723
## Specificity            0.9726   0.9915   0.9701   0.9905   0.9991
## Pos Pred Value         0.9344   0.9616   0.8684   0.9488   0.9957
## Neg Pred Value         0.9927   0.9742   0.9862   0.9795   0.9938
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2794   0.1723   0.1631   0.1466   0.1787
## Detection Prevalence   0.2990   0.1792   0.1879   0.1545   0.1795
## Balanced Accuracy      0.9773   0.9411   0.9529   0.9424   0.9857

It worked great but RM was better

Results

So from the above analysis it is clear that Random Forest is taking the lead in term of prediction.